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Abstract 

In this article supervised learning problems are solved using soft rule 
ensembles. We first review the importance sampling learning ensembles 
(ISLE) approach that is useful for generating hard rules. The soft rules 
are obtained with logistic regression using the corresponding hard rules. 
Soft rules are useful when both the response and the input variables are 
continuous because the soft rules provide smooth transitions around the 
boundaries of a hard rule. Various examples and simulation results show 
that soft rule ensembles can improve predictive performance over hard 
rule ensembles. 



1 Introduction 

A relatively new approach to modeling data, namely the ensemble learning 
([19], [15], [22]) challenges the monist views by providing solutions to complex 
problems simultaneously from a number of models. By focusing on regularities 
and stable common behavior, ensemble modeling approaches provide solutions 
that as a whole outperform the single models. Some influential early works in 
ensemble learning were by Breiman with Bagging (bootstrap aggregating) ([3]), 
and Freund and Shapire with AdaBoost ([8]). All of these methods involve 
random sampling the "space of models" to produce an ensemble of models. 

Although not necessary, in practice ensemble models are usually used with 
regression / classification trees or binary rules extracted from them which are 
discontinuous and piecewise constant. In order to approximate a smooth re- 
sponse variable, a large number of trees or rules with many splits are needed. 
This causes data fragmentation for high dimensional problems where the data is 
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already sparse. The soft rule ensembles that are proposed in this paper attacks 
this problem by replacing the hard rules with smooth functions. 

In the rest of this section, we review the ensemble model generation and 
post-processing approach due to Popescu & Friedman ([H]). Their approach 
attempts to unify ensemble learning methods. In the next section, we explain 
how to convert hard rules to soft rules using bias corrected logistic regression. 
In section 3, we compare the soft and hard rule examples. The paper concludes 
with our comments and discussions. 

Suppose we are asked to predict the continuous outcome variable y from 
p vector of input variables x. We restrict the prediction models to the model 
family J£" = {f(x:9) : 8 6 9}. The models considered by the ISLE framework 
have an additive expansion of the form: 

A I 

F(x)=w + J2^ J f(x,e j ) (1) 

J=l 

where {f(x, Oj)}jLi are base learners selected from Popescu & Friedman's 
ISLE approach ([H]) uses a heuristic two-step approach to arrive at F(x). The 
first step involves sampling the space of possible models to obtain {Oj}jL t . The 
models in the model family & are sampled using perturbation sampling; by 
varying case weights, data values, variable subsets, or partitions of the input 
space ( 27 ). The second step combines the predictions from these models by 
choosing weights {Wj}jL in {!]). 

The pseudo code to produce M models {f(x, Oj)}fL 1 under ISLE framework 
Algorithm 1.1: lSLE(M,v,rj) 

F (x) = 0. 
for j=l to M 

is given below: [ (cj,8j) = argmmJ2 i(£ s (v) L (Vi> F j-i( x i) + cf{x l ,9)) 

, I M) 
d ° )T J (x)=f(x,9 J ) 

[F j {x)=F j _ 1 {x) + vc j T j (x) 
return ({T^x^^andFMix).) 

Here, L{., .) is a loss function, Sj(rj) is a subset of the indices {1,2,..., n} 
chosen by a sampling scheme rj, < v < 1 is a memory parameter. 

The classic ensemble methods of Bagging, Random Forest, AdaBoost, and 
Gradient Boosting are special cases of the generic ensemble generation procedure 
([27]). The weights {wj}jL can be selected in a number of ways, for Bagging 
and Random Forests these weights are set to predetermined values, i.e. wq = 
and Wj = jj for j = 1,2, ... ,M. Boosting calculates these weights in stage 
wise fashion at each step by having positive memory /i, estimating Cj and takes 
Fm(x) as the final prediction model. 

Friedman & Popescu ([H]) recommend learning the weights {wj}jL using 

LASSO (US]). Let T = (? 1 i (a; J ))™ =1 ^ =1 be the n x M matrix of predictions 
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for the n observations by the M models in an ensemble. The weights (wo,w = 
{w m }m=i) are obtained from 

M 

w = argmin(y - w l n - Tw)'(y - w l n - Tw) + A V] \w m \. (2) 

w 

3 = 1 

A > is the shrinkage operator, larger values of A decreases the number of 
models included in the final prediction model. The final ensemble model is 
given by F(x) = w + Y,m=i w m T m {x). 

The base learners in the preceding sections of this article can be of any kind, 
however usually they are regression or decision trees. Each decision tree in the 
ensemble partitions the input space using the product of indicator functions 
of "simple" regions based on several input variables. A tree with K terminal 
nodes define a K partition of the input space where the membership to a spe- 
cific node, say node k, can be accomplished by applying the conjunctive rule 
Tk{x) — nf=i I( x i e s ik): where J(.) is the indicator function. The regions 
Sik are intervals for continuous variables and subsets of the levels for categori- 
cal variables. Therefore, a rule corresponds to a region that is the intersection 
of half spaces defined by hyperplanes that are orthogonal to the axis of the 
predictor variables. 

A regression tree with K terminal nodes can be written as 

K 

T(x) = J2Mx). (3) 
fc=i 

Trees with many terminal nodes usually produce more complex rules and tree 
size is an important meta-parameter which we can control by maximum tree 
depth and cost pruning. 

Given a set of decision trees, rules can be extracted from each of these 
trees to define a collection of conjunctive rules (Figure [TJ). A conjunctive rule 
r ( x ) = nf=i I{ x i e s i) can & l so ^ e expressed as a logic rule (also called Boolean 
expressions and logic statement) involving only the A ("and") operator. In 
general, a logic statement is constructed using the operators A ("and"), V ("or") 
and c ("not") and brackets. An example simple logic rule is l(x) = [I(x\ € 
Si) V I c (x 2 G s 2 )] A I(x 3 € S3). Logic Regression ([23]) is an adaptive regression 
methodology that constructs logic rules from binary input variables. Simple 
conjunctive rules that are learned by the the ISLE approach can be used as input 
variables to logic regression to combine these rules. However, the representation 
of a logic rule in general is not unique and it can be shown that all logic rules 
can be expressed in disjunctive normal form where we only use V combinations 
of A (not necessarily simple) terms. 

Let R = (Tk(&i))i=u—i ^ e the n x L matrix of rules for the n observations 
by the L rules in the ensemble. The rulefit algorithm of Friedman & Popescu 
p"3] uses the weights (wo,w = {w{\f =l ) that are estimated from 

L 

w = argmin(y - w l n - Rw)'(y - w l n - Rw) + \ w i\ ( 4 ) 
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Rule 1: l(x<0)l(z<l) 



y=20 




y=15 




v=io 
Rule 3: l(x>=0) 



in the final prediction model F(x) = wo + S;=i wiri(x). 

Rule ensembles are shown to produce improved accuracy over traditional 
ensemble methods like random forests, bagging, boosting and ISLE ([12], [12] 
and [27]). 

2 Soft Rules from Hard Rules 

Soft rules which take values in [0, 1] are obtained by replacing each hard rule 
r(x) with a logistic function of the form 

S{X) = l + exp(-g(x;0)Y 

The value of a soft rule s(x) can be viewed as the probability that that rule is 
fired for x. 

In this paper, g(x; 9) includes additive terms of order two without any in- 
teraction terms in the variables which were used explicitly in the construction 
of the rule r{x). The model is built using best subsets regression, selecting the 
best subset of terms for which the AIC is minimized. The coefficients 9 of the 
function g(x; 9) are to be estimated from the examples of x and r(x) in the 
training data. 

A common problem with logistic regression is the problem of (perfect) sepa- 
ration (p~Zj) which occurs when the response variable can be perfectly separated 
by one or a linear combination of a few explanatory variables. When this is the 
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case, the likelihood function becomes monotone and non finite estimates of co- 
efficients are produced. In order to deal with the problem of separation, Firth's 
bias corrected likelihood approach ([7]) has been recommended f[17j). The bias 
corrected likelihood approach to logistic regression are guaranteed to produce 
finite estimates and standard errors. 

Maximum likelihood estimators of the coefficients 9 are obtained as the so- 
lution to the score equation 

dlogL(6) / dO = U{9) = 

where L(9) is the likelihood function. Firth's bias corrected likelihood uses a 
modified likelihood function 

L*{6) = L{d)\i{6)\ 1 ^ 

where i(9) is the Jeffreys (|21j) invariant prior, the Fisher information matrix. 

Using the modified likelihood function the score function for the logistic 
model is given by U*{6) = (U*(9i), U*{9 2 ), U*{9 k ))' where 

U*(9j) = f>(*<) - gixi-,6) + h i {\- g ( Xi -6))}^^- 

i=l i 

for j — 1,2, ... ,k and k is the number of coefficients in g(x; 9). Here, hi for 
i = 1,2, ... ,n are the ith diagonal elements of the hat matrix 

H = W 1/2 X{X'WX)- 1 X'W 1/2 

and W — diag{g(xi, 9)(1 — g{xi] 9)).} Bias corrected estimates can be obtained 
in an iterative fashion using 

fft+i _ e l +r 1 {9 t )U*{9 t ). 

Our programs utilize the "brglm" package in R ([24]) that fits binomial- response 
generalized linear models using the bias-reduction. 

In [5] hard rules are replaced with products of univariate logistic functions 
to build the models called tree-structured smooth transition regression models 
that generalize the regression tree models. A similar model called soft decision 
trees are introduced in 20 . These authors use logistic functions to calculate 
gating probabilities at each node in an hierarchical structure where the children 
of each node are selected with a certain probability, the terminal nodes of the 
incrementally learned trees are represented as a product of logistic functions. In 
[6] tree splits are softened using simulated annealing. Fuzzy decision trees were 
presented by [26]. Perhaps, these models can be used to generate soft rules. 
However, in this paper, we use a simpler approach where we utilize logistic 
functions only at the terminal nodes of the trees built by the CART algorithm. 

In Figure [I] we present simple hard rules and the corresponding soft rules 
estimated from the training data. It is clear that the soft rules provide a smooth 
approximation to the hard rules. 
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N=200 
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Figure 1: Hard rules and the corresponding soft rules estimated from the train- 
ing data. It is clear that the soft rules provide a smooth approximation to the 
hard rules. 
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Let R = (ri(x i ))™ =ll _ 1 be the nx L matrix of L rules for the n observations 
in the training sample. Letting si{x\6{) be the soft rule corresponding to the 
Zth hard rule, define S = {si{xi))™ =ll , as the n x L matrix of L soft rules for 
the n observations. 

The weights for the soft rules can be estimated from the LASSO: 

L 

w = argmin(y - wol n - Sw)'(y - Wol n - Sw) + \wi\. (5) 

1=1 

This leads to the final soft rule ensemble prediction model F(x) = w + 

The only additional step for building our soft rules ensembles is the hard to 
soft rule conversion step. This makes our algorithm slower than the the rulefit 
algorithm ([12]). at times, 10 times slower. However, we have implemented our 
soft rules algorithm in the R language and successfully applied it to several high 
dimensional problems. We expect that a faster implementation is possible if 
the code is migrated to a faster programming language. In addition parts or 
the whole of the soft rule generation process can be accomplished by parallel 
processing. 

For completeness and easy reference, we summarize the steps for soft rule 
ensemble generation and fitting: 

1. Use ISLE algorithm to generate M trees: T(X). 

2. Extract hard rules from T(X): R(X). 

3. Convert hard rules to soft rules: S(X). 

4. Obtain soft rule weights by LASSO. 

We should note that no additional meta-parameters are introduced to pro- 
duce soft rules from hard rules. We set these meta-parameters along the lines 
of the recommendations in previous work ( [TJ] , [T3] and [37] ) . 

There are several fast algorithms that can accomplish the LASSO post pro- 
cessing for large datasets (large n or p): Recent pathwise coordinate descent ([5]) 
(implemented in "glmnet" with R ((TH])) algorithm provide the entire path of 
solutions. We have used the "glmnet" in our illustrations in the next section, the 
value of the sparsity parameter was set by minimizing the mean cross- validated 
error. Due to the well known selection property of the lasso penalty, only 10%- 
20% of the rules are retained in the final model after the post-processing step. 

3 Illustrations 

In this section we are going to compare the soft rule and hard rule ensembles. 
The prediction accuracy is taken as the cross validated correlation or the mean 
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square error (MSE) calculated for the predicted and true target variable val- 
ues for regression examples and cross validated area under the ROC curve for 
classification examples. 

The number of trees to be generated by the ISLE algorithm was set to 
M = 400, each tree used 30% of of the individuals and 10% of the input variables 
randomly selected from the training set. Larger trees allow more complex rules 
to be produced and therefore controlling tree size controls the maximum rule 
complexity A choice of this parameter can be based on prior knowledge or based 
on experiments with different values. We have tried differing three depths for 
obtaining models with varying complexity. For most of the examples accuracies 
were reported for each tree depth. In example 13.21 models with differing rule 
depths were trained and only the models with best cross validated performances 
were reported. In addition, the memory parameter v of the ISLE ensemble 
generation algorithm is set to zero in all the following examples. 

Example 3.1. (Boston Housing Data, Regression) In order to compare the 
performance of prediction models based on hard and soft rules we use the famous 
benchmark "Boston Housing" data set (Mty ). This data set includes n=506 
observations and p—14 variables. The response variable is the median house 
value from the rest of the 13 variables in the data set. 10 fold cross validated 
accuracies are displayed in Table [7J Using soft rules we gain couple points 
improvement on the accuracies. 





accuracy 




rmse 




tree size 


hard 


soft 


hard 


soft 


2 


0.91 


0.92 


3.78 


3.60 


3 


0.93 


0.93 


3.40 


3.42 


4 


0.94 


0.93 


3.18 


3.29 


5 


0.93 


0.94 


3.33 


3.18 



Table 1: The 10-fold cross validated prediction accuracies as measured by the 
correlation of the true and predicted values are given for the " Boston housing 
data" . 

As observed from the previous example, for continuous input variables soft 
rule ensembles might have better prediction accuracy then its hard rule ensem- 
bles counterpart. However, for problems with only categorical or discrete input 
variables, we do not expect to see the same improvements. The following exam- 
ple only uses discrete SNP markers (biallelic markers values coded as -1,0 and 
1) as input variables and hard rules and soft rules give approximately the same 
accuracies. 

Example 3.2. (Plant Breeding Data, Regression) In our second example we an- 
alyze plant breeding data and compare the predictive performance of hard rules 
with soft rules. In both cases, the objective is to predict a quantitative trait (ob- 
served performance ) using molecular markers data providing information about 
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the genotypes of the plants. Predictions of phenotypes using numerous molecular 
markers at the same time is called genomic selection and has received lately a 
lot of attention in the plant and animal breeding communities. Rule ensembles 
used with genetic marker data implicitly captures epistasis (interaction between 
markers) in a highly dimensional context while retaining interpretability of the 
model. 

The first data set (Bay x Sha) contains measurements on flowering time 
under short day length, dry matter under non limiting or limiting conditions 
from 422 recombinant inbred lines from a biparental population of Arabidopsis 
thaliana plants from 2 ecotypes, Bay-0 and Shadara genotyped with 69 SSRs. 
Data available from the Study of the Natural Variation of A. thaliana website 

<W- W)- 

The second data set (Wheat CIMMYT) is composed of 599 spring wheat 
inbred lines evaluated for yield in 4 different target environments (YLD1-YLD4). 
1279 DArT markers were available for the 599 lines in the study (J$)- The 
results are displayed in Tabled 



cimmyt 










Bay Sha 












accuracy 




rmse 






accuracy 




rmse 




tree size 


hard 


soft 


hard 


soft 


tree size 


hard 


soft 


hard 


soft 


2 


0.50 


0.51 


0.87 


0.86 


2 


0.86 


0.86 


4.66 


4.64 




0.41 


0.41 


0.92 


0.92 




0.67 


0.67 


2.17 


2.18 




0.36 


0.36 


0.96 


0.96 




0.37 


0.37 


1.18 


1.18 




0.42 


0.42 


0.92 


0.92 


3 


0.86 


0.85 


4.75 


4.77 


3 


0.52 


0.52 


0.86 


0.86 




0.62 


0.62 


2.30 


2.30 




0.42 


0.43 


0.92 


0.91 




0.33 


0.33 


1.22 


1.22 




0.40 


0.40 


0.93 


0.93 














0.40 


0.41 


0.94 


0.93 













Table 2: Accuracies of hard and soft rule ensembles are compared by the cross 
validated Pearson's correlation coefficients between the estimated and true val- 
ues. For each data set we have trained models based on rules of depth 1 to 6, 
the results from the model with best cross validated accuracies is reported. The 
results show no significant difference between hard or soft rules. 

When the task is classification, no significant improvement is achieved by 
preferring soft rules over hard rules. To compare soft and hard rule ensem- 
bles in the context of classification, we use the Arcene and Madelon datasets 
downloaded from UCI Machine Learning Repository. 

Example 3.3. (Arcene data, Classification) The task in arcene data is to clas- 
sify patterns as normal or cancer based on mass-spectrometric data. There were 
7000 initial input variables, 3000 probe input variables were added to increase 
the difficulty of the problem. There were 200 individuals in the data set. Areas 
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under the ROC curves for soft and hard rule ensemble models based on rules 
with depths 2 to 8 are compared in Tabled (left). 

Example 3.4. (Madelon data, Classification) Madelon is contains data points 
grouped in 32 clusters placed on the vertices of a five dimensional hypercube and 
randomly labeled +1 or -1. There were 500 continuous input variables, Jf.80 of 
these were probes. There were 2600 labeled examples. Areas under the ROC 
curves for rules with depths 2 to 7 are compared in Tabled (right). 



Arcene 






Madelon 






tree size 


hard 


soft 


tree size 


hard 


soft 


2 


0.82 


0.82 


2 


0.83 


0.87 


3 


0.87 


0.82 


3 


0.86 


0.88 


4 


0.79 


0.80 


4 


0.90 


0.92 



Table 3: 10- fold cross validated areas under the ROC curves for soft and hard 
rule ensemble models based on rules with depths 2 to 7 for the Arsene (left) and 
Madelon (right) data sets. We do not observe any significant difference between 
soft and hard rules. 

When both response and input variables are continuous, soft rules perform 
better than hard rules. In the last two examples, we compare our models via 
mean squared errors. 

Example 3.5. (Simulated Data, Regression) This regression problem is de- 
scribed in Friedman flTJj) and Breiman (^3j). Elements of the input vector 
x = (xi, x 2 , ■ ■ ■ , Xio) are generated from uniform(0,l) distribution indepen- 
dently, only 5 out of these 10 are actually used to calculate the target variable y 
as 

y = 10 sin(7rxia;2) + 20(a; 3 - 0.5) 2 + 10x 4 + 5x 5 + e 

where e ~ A(0, 1). 1000 independent realizations of (x,y) constitute the training 
data. Mean squared errors for models are calculated on a test sample of the same 
size. The boxplots in left Figure [H summarize the prediction accuracies for soft 
and hard rules over 30 replications of the experiment. 

Example 3.6. (Simulated Data, Regression) Another problem described in Fried- 
man ([11]) and Breiman (™). Inputs are 4 independent variables uniformly 
distributed over the ranges < x\ < 100, 407r < x-i < 5607T, < £3 < 1, 
1 < £4 < 11. The outputs are created according to the formula y = (x\ + 
(2:2X3 — (l/(x2£4)) 2 ) ' 5 + e where e is N(0,sd = 125). 1000 independent real- 
izations of (x,y) constitute the training data. Mean squared errors for models 
are calculated on a test sample of the same size. The boxplots in right Figure^ 
summarize the prediction accuracies for soft and hard rules. 
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Figure 2: The boxplots summarize the prediction accuracies for soft and hard 
rules over 30 replications of the experiments described in Examples 3.5 and 3.6. 
In terms of mean squared errors the soft rule ensembles perform better than 
the corresponding hard rule ensembles for all values of tree depth. The hard 
rule ensemble models are denoted by "str" and soft rule ensemble models are 
denoted by " soft" . The numbers next to these acronyms is the depth of the 
corresponding hard rules. 
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4 Discussions 



As our examples in the previous section suggest, the best case for soft rules 
is when both input and output variables are continuous. For data sets with 
mixed input variables it might be better to use two sets of rules (hard rules for 
categorical variables and soft rules for numerical variables) and combine in a 
supervised learning model with group lasso model ([29.) or perhaps these rules 
can be combined in a multiple kernel model ([2], [13]). 

The hard rules or the soft rules can be used as input variables in any su- 
pervised or unsupervised learning problem. In [1], several promising hard rule 
ensemble methods were proposed for supervised, semi-supervised and unsuper- 
vised learning. For instance, the model weights can be obtained using partial 
least squares regression. A similarity matrix obtained from hard rules can be 
used as a learned kernel matrix in Gaussian process regression. It is straight- 
forward to use these and similar methods with soft rules. 

Several model interpretation tools have been developed to use with rule en- 
sembles and the ISLE models. These include local and global rule, variable and 
interaction importance measures and partial dependence functions ( 13 J. We 
can use the same tools to interpret the soft rule ensemble model. For exam- 
ple, the absolute values of the standardized coefficients can be used to evaluate 
the importance of rules. A measure of importance for each input variable can 
be obtained as the sum the importances of rules that involve that variable. 
The interaction importance measures and the partial dependence functions de- 
scribed in [13) are general measures and they also naturally apply to our soft 
rule ensembles. 

The ensemble approaches are also a remedy for memory problems faced in 
analyzing big datasets. By adjusting the sampling scheme in the ISLE algorithm 
we were able to analyze large data sets which have thousands of variables and 
tens of thousands of individuals by only loading fractions of the data into the 
memory at a time. 
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